Machine Learning: AllLife Bank Personal Loan Campaign¶
Problem Statement¶
Context¶
AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
Objective¶
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Importing necessary libraries¶
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
# To ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
# to define a common seed value to be used throughout
RS=0
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
Loading the dataset¶
#from google.colab import drive
#drive.mount('/content/drive')
Before we bring in the data, let's check to see the column name to see if there is something like a ZipCode
# loading the dataset, but bring in the ZIPCode as a string
loan_data = pd.read_csv("Loan_Modelling.csv", dtype={'ZIPCode': 'str'})
# copying the data to another variable to avoid any changes to original data
df = loan_data.copy()
Data Overview¶
- Observations
- Sanity checks
Review the first few rows¶
# viewing the first 5 rows of the data
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
Check the shape of the dataset¶
df.shape
(5000, 14)
Observation
- The dataset has 5000 rows and 14 columns
Check the data types of the columns for the dataset¶
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null object 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(12), object(1) memory usage: 547.0+ KB
- All of the columns in the data are numeric, except for ZIPCode that we explicitly brought in as a string.
Check for missing values¶
# checking for missing values
df.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
Observation
- There are no missing values
Check for duplicate values¶
# checking the number of unique values in each column
df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
Observation
- There are no obvious duplicate rows (possibly because the data has an ID column), so let's also check to be see whether the rows are all unique without ID
Let's check more closely to see if rows have the same data
df[df[['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']].duplicated() == True]
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard |
|---|
Observation
- None of the rows have the exact same data
Examine zipcode data¶
We can get some additional data from ZIPCode columns, specifying better location data. Lets begin with getting the state.
!pip install pyzipcode
Collecting pyzipcode
Downloading pyzipcode-3.0.1.tar.gz (1.9 MB)
---------------------------------------- 0.0/1.9 MB ? eta -:--:--
---------------------------------------- 0.0/1.9 MB ? eta -:--:--
---------------------------------------- 0.0/1.9 MB 217.9 kB/s eta 0:00:09
--------------------------------------- 0.0/1.9 MB 217.9 kB/s eta 0:00:09
--------------------------------------- 0.0/1.9 MB 245.8 kB/s eta 0:00:08
--------------------------------------- 0.0/1.9 MB 245.8 kB/s eta 0:00:08
- -------------------------------------- 0.1/1.9 MB 348.6 kB/s eta 0:00:06
- -------------------------------------- 0.1/1.9 MB 348.6 kB/s eta 0:00:06
- -------------------------------------- 0.1/1.9 MB 348.6 kB/s eta 0:00:06
---- ----------------------------------- 0.2/1.9 MB 597.3 kB/s eta 0:00:03
--------- ------------------------------ 0.5/1.9 MB 1.1 MB/s eta 0:00:02
-------------- ------------------------- 0.7/1.9 MB 1.6 MB/s eta 0:00:01
----------------------- ---------------- 1.1/1.9 MB 2.2 MB/s eta 0:00:01
--------------------------------------- 1.9/1.9 MB 3.5 MB/s eta 0:00:01
---------------------------------------- 1.9/1.9 MB 3.4 MB/s eta 0:00:00
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pyzipcode
Building wheel for pyzipcode (setup.py): started
Building wheel for pyzipcode (setup.py): finished with status 'done'
Created wheel for pyzipcode: filename=pyzipcode-3.0.1-py3-none-any.whl size=1932204 sha256=039f2a4ccea653f238ba699578cf4f47a44a2a692cf8feac6d1541ebb7c2c090
Stored in directory: c:\users\bruce\appdata\local\pip\cache\wheels\ab\f5\51\28e2517ce97289ebabfda69345b275acb17cd1be9444715b5c
Successfully built pyzipcode
Installing collected packages: pyzipcode
Successfully installed pyzipcode-3.0.1
from pyzipcode import ZipCodeDatabase
def get_state(zipcode):
zcdb = ZipCodeDatabase()
try:
return zcdb[zipcode].state
except KeyError:
return 'XX'
def get_city(zipcode):
zcdb = ZipCodeDatabase()
try:
return zcdb[zipcode].city
except KeyError:
return 'Unknown'
# create the database
zcdb = ZipCodeDatabase()
df['State'] = df.loc[df['ZIPCode'].notnull(), 'ZIPCode'].apply(lambda x: get_state(x))
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | CA |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | CA |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | CA |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | CA |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | CA |
Check for errors
# print the bad zipcodes
print('bad zipcodes', df.loc[df['State'] == 'XX', 'ZIPCode'].unique())
# row count with errors
print('bad zipcode rows', df.loc[df['State'] == 'XX', 'ZIPCode'].count())
bad zipcodes ['92717' '93077' '92634' '96651'] bad zipcode rows 34
- There are 34 errors with the four bad zipcodes
Upon investigation zipcodes beginning with:
- 927 is probably California
- 930 is probably California
- 926 is probably California
- 966 is probably California military address
Let's make the few affected rows as California
df['State'] = df['State'].apply(lambda x: 'CA' if x == 'XX' else x)
## How many of the 5000 rows are in california?
print('California customers', df.loc[df['State'] == 'CA'].shape[0])
California customers 5000
Observation
- All of the customers are in California
Let's create a new column to indicate city.
df['City'] = df.loc[df['ZIPCode'].notnull(), 'ZIPCode'].apply(lambda x: get_city(x))
## How many unique cities?
print('Unique cities', df['City'].nunique())
Unique cities 243
df['City'].unique().tolist()
['Pasadena', 'Los Angeles', 'Berkeley', 'San Francisco', 'Northridge', 'San Diego', 'Claremont', 'Monterey', 'Ojai', 'Redondo Beach', 'Santa Barbara', 'Belvedere Tiburon', 'Glendora', 'Santa Clara', 'Capitola', 'Stanford', 'Studio City', 'Daly City', 'Newbury Park', 'Arcata', 'Santa Cruz', 'Fremont', 'Richmond', 'Mountain View', 'Huntington Beach', 'Sacramento', 'San Clemente', 'Davis', 'Redwood City', 'Cupertino', 'Santa Clarita', 'Roseville', 'Redlands', 'La Jolla', 'Brisbane', 'El Segundo', 'Los Altos', 'Santa Monica', 'San Luis Obispo', 'Pleasant Hill', 'Thousand Oaks', 'Rancho Cordova', 'San Jose', 'Reseda', 'Salinas', 'Cardiff By The Sea', 'Oakland', 'San Rafael', 'Banning', 'Bakersfield', 'Riverside', 'Rancho Cucamonga', 'Alameda', 'Palo Alto', 'Livermore', 'Irvine', 'South San Francisco', 'Emeryville', 'Ridgecrest', 'Unknown', 'Hayward', 'San Gabriel', 'Santa Ana', 'Loma Linda', 'Encinitas', 'Fullerton', 'Agoura Hills', 'San Marcos', 'Fresno', 'Long Beach', 'Milpitas', 'Camarillo', 'Rohnert Park', 'Rosemead', 'Sherman Oaks', 'Seaside', 'Goleta', 'Walnut Creek', 'Menlo Park', 'Albany', 'Torrance', 'Hawthorne', 'Eureka', 'La Mesa', 'Edwards', 'San Ysidro', 'San Leandro', 'Mission Hills', 'Valencia', 'South Lake Tahoe', 'Venice', 'Anaheim', 'Sunnyvale', 'Laguna Niguel', 'Costa Mesa', 'San Ramon', 'Mission Viejo', 'San Bernardino', 'Belmont', 'Moss Landing', 'Bodega Bay', 'Hollister', 'San Pablo', 'La Palma', 'Garden Grove', 'West Sacramento', 'Seal Beach', 'Glendale', 'Chico', 'Lompoc', 'Cypress', 'Manhattan Beach', 'Folsom', 'Sanger', 'Canoga Park', 'Carson', 'Hermosa Beach', 'Vallejo', 'Fallbrook', 'Oceanside', 'Escondido', 'Highland', 'San Mateo', 'Greenbrae', 'Ukiah', 'Chino Hills', 'Chatsworth', 'Antioch', 'Orange', 'Hacienda Heights', 'Fawnskin', 'Novato', 'Pleasanton', 'Baldwin Park', 'San Luis Rey', 'Sylmar', 'Culver City', 'Arcadia', 'Pomona', 'Carlsbad', 'Montebello', 'Tustin', 'March Air Force Base', 'Carpinteria', 'Stockton', 'Lomita', 'Fairfield', 'Burlingame', 'Beverly Hills', 'Gilroy', 'Placentia', 'Concord', 'San Juan Bautista', 'Laguna Hills', 'Brea', 'Chula Vista', 'San Anselmo', 'Bonita', 'Citrus Heights', 'Ventura', 'Tehachapi', 'Imperial', 'Monterey Park', 'Montague', 'South Pasadena', 'Santa Rosa', 'Monrovia', 'Merced', 'National City', 'Simi Valley', 'Sunland', 'Newport Beach', 'Elk Grove', 'Trinity Center', 'San Bruno', 'Larkspur', 'El Dorado Hills', 'Poway', 'Calabasas', 'Crestline', 'La Mirada', 'Clovis', 'North Hollywood', 'San Juan Capistrano', 'Norwalk', 'Yorba Linda', 'Campbell', 'Los Alamitos', 'Aptos', 'Woodland Hills', 'Montclair', 'Westlake Village', 'Modesto', 'Castro Valley', 'Yucaipa', 'Palos Verdes Peninsula', 'Los Gatos', 'Half Moon Bay', 'Oxnard', 'Oak View', 'North Hills', 'El Sobrante', 'Martinez', 'Inglewood', 'Vista', 'Whittier', 'Rio Vista', 'Saratoga', 'Morgan Hill', 'Portola Valley', 'Redding', 'Sierra Madre', 'Sonora', 'Danville', 'Bella Vista', 'Boulder Creek', 'Lake Forest', 'Ceres', 'Alhambra', 'Chino', 'Pacific Grove', 'Napa', 'Marina', 'Alamo', 'Moraga', 'Hopland', 'Santa Ynez', 'Ben Lomond', 'Van Nuys', 'Capistrano Beach', 'Sausalito', 'Upland', 'Diamond Bar', 'South Gate', 'Clearlake', 'Ladera Ranch', 'Rancho Palos Verdes', 'Pacific Palisades', 'West Covina', 'San Dimas', 'Tahoe City', 'Weed', 'Stinson Beach']
Drop ID column and state column¶
We can drop the ID columns as it does not provide value to the analysis
df.drop(columns=["ID"], inplace=True)
df.drop(columns=["State"], inplace=True)
Check statistical summary¶
# Let's look at the statistical summary of the data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.0 | 45.0 | 55.0 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.0 | 20.0 | 30.0 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.0 | 64.0 | 98.0 | 224.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.7 | 1.5 | 2.5 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.0 | 0.0 | 101.0 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Observations
- Age and experience are not generally skewed. Average age is 45, average experience is 20
- Family size is slightly skewed
- Income is skewed
- CC Average
- Most people do not have CD, Securities, credit card, or a personal loan.
- Half of the people are online, although it is skewed
- The top 25% have mortgages up to 635
Exploratory Data Analysis.¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
- How many customers have credit cards?
- What are the attributes that have a strong correlation with the target attribute (personal loan)?
- How does a customer's interest in purchasing a loan vary with their age?
- How does a customer's interest in purchasing a loan vary with their education?
Preparation for EDA¶
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?¶
histogram_boxplot(df, "Mortgage")
mortgages = df.loc[df['Mortgage'] > 0].shape[0]
print('Customers with mortgages', mortgages)
print('Percentage with mortgages', mortgages/df.shape[0])
Customers with mortgages 1538 Percentage with mortgages 0.3076
Observations
- Most people do not have mortgages. Those that do are outliers.
- 1528 have mortgages. About 31% have mortgages.
How many customers have credit cards?¶
labeled_barplot(df, "CreditCard", perc=True)
creditcards = df.loc[df['CreditCard'] > 0].shape[0]
print('Customers with credit cards', creditcards)
print('Percentage with credit cards', creditcards/df.shape[0])
Customers with credit cards 1470 Percentage with credit cards 0.294
Observations
- 70% do not have credit cards
- 1470 have credit cards
What are the attributes that have a strong correlation with the target attribute (personal loan)¶
cols_list = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
df.corr(numeric_only=True)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.994215 | -0.055269 | -0.046418 | -0.052012 | 0.041334 | -0.012539 | -0.007726 | -0.000436 | 0.008043 | 0.013702 | 0.007681 |
| Experience | 0.994215 | 1.000000 | -0.046574 | -0.052563 | -0.050077 | 0.013152 | -0.010582 | -0.007413 | -0.001232 | 0.010353 | 0.013898 | 0.008967 |
| Income | -0.055269 | -0.046574 | 1.000000 | -0.157501 | 0.645984 | -0.187524 | 0.206806 | 0.502462 | -0.002616 | 0.169738 | 0.014206 | -0.002385 |
| Family | -0.046418 | -0.052563 | -0.157501 | 1.000000 | -0.109275 | 0.064929 | -0.020445 | 0.061367 | 0.019994 | 0.014110 | 0.010354 | 0.011588 |
| CCAvg | -0.052012 | -0.050077 | 0.645984 | -0.109275 | 1.000000 | -0.136124 | 0.109905 | 0.366889 | 0.015086 | 0.136534 | -0.003611 | -0.006689 |
| Education | 0.041334 | 0.013152 | -0.187524 | 0.064929 | -0.136124 | 1.000000 | -0.033327 | 0.136722 | -0.010812 | 0.013934 | -0.015004 | -0.011014 |
| Mortgage | -0.012539 | -0.010582 | 0.206806 | -0.020445 | 0.109905 | -0.033327 | 1.000000 | 0.142095 | -0.005411 | 0.089311 | -0.005995 | -0.007231 |
| Personal_Loan | -0.007726 | -0.007413 | 0.502462 | 0.061367 | 0.366889 | 0.136722 | 0.142095 | 1.000000 | 0.021954 | 0.316355 | 0.006278 | 0.002802 |
| Securities_Account | -0.000436 | -0.001232 | -0.002616 | 0.019994 | 0.015086 | -0.010812 | -0.005411 | 0.021954 | 1.000000 | 0.317034 | 0.012627 | -0.015028 |
| CD_Account | 0.008043 | 0.010353 | 0.169738 | 0.014110 | 0.136534 | 0.013934 | 0.089311 | 0.316355 | 0.317034 | 1.000000 | 0.175880 | 0.278644 |
| Online | 0.013702 | 0.013898 | 0.014206 | 0.010354 | -0.003611 | -0.015004 | -0.005995 | 0.006278 | 0.012627 | 0.175880 | 1.000000 | 0.004210 |
| CreditCard | 0.007681 | 0.008967 | -0.002385 | 0.011588 | -0.006689 | -0.011014 | -0.007231 | 0.002802 | -0.015028 | 0.278644 | 0.004210 | 1.000000 |
Observations
- Personal loan moderate positive correlation to income (.5)
- Weak positive relationship for credit card average (.37) and CD account (.32)
The positive correlations for personal loan are:
- Income
- CCAve
- CD Account
How does a customer's interest in purchasing a loan vary with their age?¶
df['Age'].corr(df['Personal_Loan'])
-0.007725617173534042
Let's analyze the relationship a bit deeper between age and personal loan¶
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
Observation
There is no correlations between age and personal loan
How does a customer's interest in purchasing a loan vary with their education?¶
df['Education'].corr(df['Personal_Loan'])
0.13672155003028072
Observation
- There is a weak positive correlation between education and personal loan
Let's analyze the relationship a bit deeper between education and personal loan¶
distribution_plot_wrt_target(df, "Education", "Personal_Loan")
Observation
- But those for those with personal loans, their education tends to be higher
Convert education data type to catagorical object¶
Education in the data is expressed as an int. However, it really doesn't represent a continuous value, such as the number of years in school. Rather it represented a milestone or degree. As such, it should be converted into catagorical
df['Degree'] = df['Education'].astype('category')
df['Degree'] = df['Degree'].cat.rename_categories({1: 'Undergrad', 2 : 'Graduate', 3 : 'Advanced'})
df['Degree']
0 Undergrad
1 Undergrad
2 Undergrad
3 Graduate
4 Graduate
...
4995 Advanced
4996 Undergrad
4997 Advanced
4998 Graduate
4999 Undergrad
Name: Degree, Length: 5000, dtype: category
Categories (3, object): ['Undergrad', 'Graduate', 'Advanced']
labeled_barplot(df, "Degree", perc=True)
Observation
- Just about half have undergrad
- Apparently everyone went to college
Additional univariate analysis¶
Here are some additional counts that may be interesting. High percentages of 0 (false columns) identifies untapped markets.
Personal Loan¶
labeled_barplot(df, "Personal_Loan", perc=True)
Observation
- 90.4% of customers do not have personal loans
Securities accounts¶
labeled_barplot(df, "Securities_Account", perc=True)
Observation
- approx 90% of customers do not have a security account
CD Account¶
labeled_barplot(df, "CD_Account", perc=True)
Observation
- 94% of customers do not have a CD account
City (customer location)¶
df['City'].value_counts()
Los Angeles 375
San Diego 269
San Francisco 257
Berkeley 241
Sacramento 148
...
Sausalito 1
Sierra Madre 1
Ladera Ranch 1
Tahoe City 1
Stinson Beach 1
Name: City, Length: 243, dtype: int64
df['City'].value_counts().nlargest(10)
Los Angeles 375 San Diego 269 San Francisco 257 Berkeley 241 Sacramento 148 Palo Alto 130 Stanford 127 Davis 121 La Jolla 112 Santa Barbara 103 Name: City, dtype: int64
top_twenty_cities = df['City'].value_counts().nlargest(20)
plt.xlabel("City")
plt.xticks(rotation=85)
plt.bar(top_twenty_cities.index, top_twenty_cities.values)
<BarContainer object of 20 artists>
Observation
Most customers are in California's largest cities. The top cities are centered in the areas of
- Los Angeles: including Pasadena, Irvine, Santa Barbara
- San Diego
- Silicon Valley: San Francisco, Berkeley, Palo Alto, Stanford, La Jolla, San Jose, Santa Clara, Monterey, Oakland, Menlo Park, Santa Cruz
Additional bivariate analysis¶
Here's an addition view of overall correlations
The hue will highlight where the personal loans were purchased
sns.pairplot(data=df, diag_kind="kde", hue="Personal_Loan")
plt.show()
Observations
- Age and experience are highly correlated and essentially duplicate their data with each other.
- There is a relatively high correlation between income and credit card spending average
- There is moderate correlation between credit card and personal loan
The highest correlations for personal loan are:
- Income
- CCAve
- CD Account
Let's analyze the relation between mortgage and personal loan¶
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
Observations
- Although most do not have mortgages, there are many who have mortgages but not personal loans.
Let's analyze the relation between income and personal loan¶
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
Observation
- Those with personal loans have a higher income (from 60 for those who do not, to around 135 for those who do)
- Many have high incomes (shown as outliers) who do not have personal loans.
Let's analyze the relations between family and personal loan¶
distribution_plot_wrt_target(df, "Family", "Personal_Loan")
Observations
- Those customers with families tend to have larger families (up from an average of two people to 3 for those with loans)
Let's anayze the relations between education and personal loan¶
distribution_plot_wrt_target(df, "Degree", "Personal_Loan")
Data Preprocessing¶
- Missing value treatment
- Feature engineering (if needed)
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)
Missing value treatment¶
Check for missing values
# count the missing values
df.isnull().sum()
Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 City 0 Degree 0 dtype: int64
Observation
- There are no null or NaN values in the data
Feature engineering¶
- The
IDfield was removed in a previous step. - A
Statecolumn was created and removed in a previous step. - Drop Education and use Degree column instead
df.drop(['ZIPCode'], axis = 1, inplace = True)
df.drop(['Education'], axis = 1, inplace = True)
df
| Age | Experience | Income | Family | CCAvg | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | City | Degree | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.6 | 0 | 0 | 1 | 0 | 0 | 0 | Pasadena | Undergrad |
| 1 | 45 | 19 | 34 | 3 | 1.5 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles | Undergrad |
| 2 | 39 | 15 | 11 | 1 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | Berkeley | Undergrad |
| 3 | 35 | 9 | 100 | 1 | 2.7 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco | Graduate |
| 4 | 35 | 8 | 45 | 4 | 1.0 | 0 | 0 | 0 | 0 | 0 | 1 | Northridge | Graduate |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 29 | 3 | 40 | 1 | 1.9 | 0 | 0 | 0 | 0 | 1 | 0 | Irvine | Advanced |
| 4996 | 30 | 4 | 15 | 4 | 0.4 | 85 | 0 | 0 | 0 | 1 | 0 | La Jolla | Undergrad |
| 4997 | 63 | 39 | 24 | 2 | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | Ojai | Advanced |
| 4998 | 65 | 40 | 49 | 3 | 0.5 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles | Graduate |
| 4999 | 28 | 4 | 83 | 3 | 0.8 | 0 | 0 | 0 | 0 | 1 | 1 | Irvine | Undergrad |
5000 rows × 13 columns
df.shape
(5000, 13)
Outlier detection¶
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations
- Age and Experience are nearly identical
- Credit card average and mortgage show that most customers have one credit card and no mortgage. The outliers show the rest of the customers that have one.
- Personal loan, securities account, CD account, credit card and online all show that customers either have one or not.
Nothing to change in the outliers. Their data will provide the binary rules for our decision tree.
Data preparation¶
In this step, we prepare the data:
- Drop the target column: personal loans
- Use
get_dummiesfor City and Degree column - Split the data into train and test
from sklearn.model_selection import train_test_split
X = df.drop(["Personal_Loan"], axis=1)
Y = df["Personal_Loan"]
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=RS
)
Determine the shape and whether training data and test data are similar
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 254) Shape of test set : (1500, 254) Percentage of classes in training set: 0 0.899429 1 0.100571 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.914667 1 0.085333 Name: Personal_Loan, dtype: float64
Observation
- We had seen that around 90% of customers have personal loans, and this is preserved in the train and test sets
df.shape
(5000, 13)
Model Building¶
The goal of the model is to correctly
- Predict whether a customer will get personal loans.
- Understand which customer attributes are most significant in driving purchases.
- Identify which segment of customers to target more.
Model Evaluation Criterion¶
Model can make wrong predictions as:
- Predicting a customer will not get a personal loan, but in reality the customer will act (FN).
- Predicting a customer will get a personal loan, but in reality the customer does not act (FP).
Which case is more important?
- If we predict correctly, the customer will get a personal loan -- the bank increases an income stream and the customer has available cash when needed.
- If we predict incorrectly, marketing money is not being used effectively.
How to maximize the effort to increase customer value?
- The cost to increase business with existing customers is significanly less than trying to secure new customers.
- Maximize the cost of growing the business by increasing personal loans with existing customers
The following code provides methods used to determine the predictions.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Model Building¶
Decision tree (default)¶
Let's begin by building the decision tree (default)
model0 = DecisionTreeClassifier(random_state=RS)
model0.fit(X_train, y_train)
DecisionTreeClassifier(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=0)
Check model performance on the training set¶
confusion_matrix_sklearn(model0, X_train, y_train)
decision_tree_default_perf_train = model_performance_classification_sklearn(
model0, X_train, y_train
)
decision_tree_default_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Check model performance on test set¶
confusion_matrix_sklearn(model0, X_test, y_test)
decision_tree_default_perf_test = model_performance_classification_sklearn(
model0, X_test, y_test
)
decision_tree_default_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984 | 0.882812 | 0.92623 | 0.904 |
Observation
- Model is giving good and generalized results on training and test set.
Decision tree (with class weights)¶
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes
In this case, we will set
class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input dataclass_weightis a hyperparameter for the decision tree classifier
model1 = DecisionTreeClassifier(random_state=1, class_weight="balanced")
model1.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
confusion_matrix_sklearn(model1, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model1, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
Observations
- Model is able to perfectly classify all the data points on the training set.
- 0 errors on the training set, each sample has been classified correctly.
- As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
- This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.
confusion_matrix_sklearn(model1, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model1, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.898438 | 0.891473 | 0.894942 |
Observation
- We have estabilished a baseline model.
- Although there is a change in the model results between the test and the training data, the training data shows Accuracy, Recall, Precision, and F1 near 90%
Let's use pruning techniques to try and reduce overfitting.
Model Performance Improvement¶
Decision tree (pre-pruning)¶
The next step is to optimize the model’s performance through hyperparameter tuning. Utilizing hyperparameter search techniques with cross-validation is a robust approach to finding the best set of hyperparameters.
Methodology
- Hyperparameter tuning is crucial because it directly affects the performance of a model.
- Unlike model parameters which are learned during training, hyperparameters need to be set before training.
- Effective hyperparameter tuning helps in improving the performance and robustness of the model.
- The below custom loop for hyperparameter tuning iterates over predefined parameter values to identify the best model based on the metric of choice (recall score).
Goal of this process
Maximize precision and recall, not necessarily improve accuracy.
[From https://towardsdatascience.com/precision-and-recall-a-simplified-view-bc25978d81e6. Emphasis added.]
Models need high recall when you need output-sensitive predictions. For example, predicting cancer or predicting terrorists needs a high recall, in other words, you need to cover false negatives as well. It is ok if a non-cancer tumor is flagged as cancerous but a cancerous tumor should not be labeled non-cancerous.
Similarly, we need high precision in places such as recommendation engines, spam mail detection, etc. Where you don’t care about false negatives but focus more on true positives and false positives. It is ok if spam comes into the inbox folder but a really important mail shouldn’t go into the spam folder.
Select search technique
Randomized search which involves random sampling parameters from a specified distribution
# Define hyperparameter distribution for Random Search
param_dist = {
'criterion': ['gini', 'entropy'],
'max_depth': [None] + list(range(10, 31)),
'min_samples_split': range(2, 11),
'min_samples_leaf': range(1, 11)
}
from sklearn.model_selection import RandomizedSearchCV
# Random Search
random_search = RandomizedSearchCV(model1, param_dist, n_iter=100, cv=5, scoring='accuracy')
random_search.fit(X_train, y_train)
best_params_random = random_search.best_params_
best_score_random = random_search.best_score_
print(f'Best Parameters (Random Search): {best_params_random}')
print(f'Best Cross-Validation Score (Random Search): {best_score_random:.2f}')
Best Parameters (Random Search): {'min_samples_split': 4, 'min_samples_leaf': 1, 'max_depth': 19, 'criterion': 'entropy'}
Best Cross-Validation Score (Random Search): 0.99
Evaluate the decision tree
best_model = DecisionTreeClassifier(**best_params_random)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print(f'Final Model Accuracy: {final_accuracy:.2f}')
Final Model Accuracy: 0.99
# creating an instance of the best model
model2 = best_model
# fitting the best model to the training data
model2.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy', max_depth=19, min_samples_split=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy', max_depth=19, min_samples_split=4)
confusion_matrix_sklearn(model2, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
model2, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.998857 | 0.991477 | 0.997143 | 0.994302 |
confusion_matrix_sklearn(model2, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
model2, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.987333 | 0.90625 | 0.943089 | 0.924303 |
Observations
- The model is giving a generalized result now since the recall scores on both the train and test data are coming to be around 0.97 which shows that the model is able to generalize well on unseen data.
- Recall and precision are over .91.
Visualize the decision tree¶
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'City_Alameda', 'City_Alamo', 'City_Albany', 'City_Alhambra', 'City_Anaheim', 'City_Antioch', 'City_Aptos', 'City_Arcadia', 'City_Arcata', 'City_Bakersfield', 'City_Baldwin Park', 'City_Banning', 'City_Bella Vista', 'City_Belmont', 'City_Belvedere Tiburon', 'City_Ben Lomond', 'City_Berkeley', 'City_Beverly Hills', 'City_Bodega Bay', 'City_Bonita', 'City_Boulder Creek', 'City_Brea', 'City_Brisbane', 'City_Burlingame', 'City_Calabasas', 'City_Camarillo', 'City_Campbell', 'City_Canoga Park', 'City_Capistrano Beach', 'City_Capitola', 'City_Cardiff By The Sea', 'City_Carlsbad', 'City_Carpinteria', 'City_Carson', 'City_Castro Valley', 'City_Ceres', 'City_Chatsworth', 'City_Chico', 'City_Chino', 'City_Chino Hills', 'City_Chula Vista', 'City_Citrus Heights', 'City_Claremont', 'City_Clearlake', 'City_Clovis', 'City_Concord', 'City_Costa Mesa', 'City_Crestline', 'City_Culver City', 'City_Cupertino', 'City_Cypress', 'City_Daly City', 'City_Danville', 'City_Davis', 'City_Diamond Bar', 'City_Edwards', 'City_El Dorado Hills', 'City_El Segundo', 'City_El Sobrante', 'City_Elk Grove', 'City_Emeryville', 'City_Encinitas', 'City_Escondido', 'City_Eureka', 'City_Fairfield', 'City_Fallbrook', 'City_Fawnskin', 'City_Folsom', 'City_Fremont', 'City_Fresno', 'City_Fullerton', 'City_Garden Grove', 'City_Gilroy', 'City_Glendale', 'City_Glendora', 'City_Goleta', 'City_Greenbrae', 'City_Hacienda Heights', 'City_Half Moon Bay', 'City_Hawthorne', 'City_Hayward', 'City_Hermosa Beach', 'City_Highland', 'City_Hollister', 'City_Hopland', 'City_Huntington Beach', 'City_Imperial', 'City_Inglewood', 'City_Irvine', 'City_La Jolla', 'City_La Mesa', 'City_La Mirada', 'City_La Palma', 'City_Ladera Ranch', 'City_Laguna Hills', 'City_Laguna Niguel', 'City_Lake Forest', 'City_Larkspur', 'City_Livermore', 'City_Loma Linda', 'City_Lomita', 'City_Lompoc', 'City_Long Beach', 'City_Los Alamitos', 'City_Los Altos', 'City_Los Angeles', 'City_Los Gatos', 'City_Manhattan Beach', 'City_March Air Force Base', 'City_Marina', 'City_Martinez', 'City_Menlo Park', 'City_Merced', 'City_Milpitas', 'City_Mission Hills', 'City_Mission Viejo', 'City_Modesto', 'City_Monrovia', 'City_Montague', 'City_Montclair', 'City_Montebello', 'City_Monterey', 'City_Monterey Park', 'City_Moraga', 'City_Morgan Hill', 'City_Moss Landing', 'City_Mountain View', 'City_Napa', 'City_National City', 'City_Newbury Park', 'City_Newport Beach', 'City_North Hills', 'City_North Hollywood', 'City_Northridge', 'City_Norwalk', 'City_Novato', 'City_Oak View', 'City_Oakland', 'City_Oceanside', 'City_Ojai', 'City_Orange', 'City_Oxnard', 'City_Pacific Grove', 'City_Pacific Palisades', 'City_Palo Alto', 'City_Palos Verdes Peninsula', 'City_Pasadena', 'City_Placentia', 'City_Pleasant Hill', 'City_Pleasanton', 'City_Pomona', 'City_Portola Valley', 'City_Poway', 'City_Rancho Cordova', 'City_Rancho Cucamonga', 'City_Rancho Palos Verdes', 'City_Redding', 'City_Redlands', 'City_Redondo Beach', 'City_Redwood City', 'City_Reseda', 'City_Richmond', 'City_Ridgecrest', 'City_Rio Vista', 'City_Riverside', 'City_Rohnert Park', 'City_Rosemead', 'City_Roseville', 'City_Sacramento', 'City_Salinas', 'City_San Anselmo', 'City_San Bernardino', 'City_San Bruno', 'City_San Clemente', 'City_San Diego', 'City_San Dimas', 'City_San Francisco', 'City_San Gabriel', 'City_San Jose', 'City_San Juan Bautista', 'City_San Juan Capistrano', 'City_San Leandro', 'City_San Luis Obispo', 'City_San Luis Rey', 'City_San Marcos', 'City_San Mateo', 'City_San Pablo', 'City_San Rafael', 'City_San Ramon', 'City_San Ysidro', 'City_Sanger', 'City_Santa Ana', 'City_Santa Barbara', 'City_Santa Clara', 'City_Santa Clarita', 'City_Santa Cruz', 'City_Santa Monica', 'City_Santa Rosa', 'City_Santa Ynez', 'City_Saratoga', 'City_Sausalito', 'City_Seal Beach', 'City_Seaside', 'City_Sherman Oaks', 'City_Sierra Madre', 'City_Simi Valley', 'City_Sonora', 'City_South Gate', 'City_South Lake Tahoe', 'City_South Pasadena', 'City_South San Francisco', 'City_Stanford', 'City_Stinson Beach', 'City_Stockton', 'City_Studio City', 'City_Sunland', 'City_Sunnyvale', 'City_Sylmar', 'City_Tahoe City', 'City_Tehachapi', 'City_Thousand Oaks', 'City_Torrance', 'City_Trinity Center', 'City_Tustin', 'City_Ukiah', 'City_Unknown', 'City_Upland', 'City_Valencia', 'City_Vallejo', 'City_Van Nuys', 'City_Venice', 'City_Ventura', 'City_Vista', 'City_Walnut Creek', 'City_Weed', 'City_West Covina', 'City_West Sacramento', 'City_Westlake Village', 'City_Whittier', 'City_Woodland Hills', 'City_Yorba Linda', 'City_Yucaipa', 'Degree_Graduate', 'Degree_Advanced']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model0,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model0, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [2462.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- City_Valencia <= 0.50 | | | | |--- Degree_Graduate <= 0.50 | | | | | |--- City_Banning <= 0.50 | | | | | | |--- City_Santa Clara <= 0.50 | | | | | | | |--- Age <= 62.50 | | | | | | | | |--- City_La Jolla <= 0.50 | | | | | | | | | |--- City_Berkeley <= 0.50 | | | | | | | | | | |--- City_San Francisco <= 0.50 | | | | | | | | | | | |--- weights: [94.00, 0.00] class: 0 | | | | | | | | | | |--- City_San Francisco > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- City_Berkeley > 0.50 | | | | | | | | | | |--- CreditCard <= 0.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- CreditCard > 0.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- City_La Jolla > 0.50 | | | | | | | | | |--- Age <= 33.00 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- Age > 33.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 62.50 | | | | | | | | |--- Family <= 1.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Family > 1.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- City_Santa Clara > 0.50 | | | | | | | |--- Experience <= 26.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Experience > 26.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- City_Banning > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Degree_Graduate > 0.50 | | | | | |--- CCAvg <= 3.85 | | | | | | |--- Age <= 36.50 | | | | | | | |--- Income <= 63.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Income > 63.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Age > 36.50 | | | | | | | |--- City_Moss Landing <= 0.50 | | | | | | | | |--- City_Berkeley <= 0.50 | | | | | | | | | |--- weights: [19.00, 0.00] class: 0 | | | | | | | | |--- City_Berkeley > 0.50 | | | | | | | | | |--- Experience <= 29.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Experience > 29.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- City_Moss Landing > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- CCAvg > 3.85 | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | |--- City_Valencia > 0.50 | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- CD_Account > 0.50 | | | |--- City_Oakland <= 0.50 | | | | |--- weights: [0.00, 8.00] class: 1 | | | |--- City_Oakland > 0.50 | | | | |--- weights: [1.00, 0.00] class: 0 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Degree_Advanced <= 0.50 | | | |--- Degree_Graduate <= 0.50 | | | | |--- Income <= 100.00 | | | | | |--- CCAvg <= 4.20 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- CCAvg > 4.20 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- Income > 100.00 | | | | | |--- Income <= 104.50 | | | | | | |--- CCAvg <= 3.31 | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.31 | | | | | | | |--- CCAvg <= 4.25 | | | | | | | | |--- Mortgage <= 124.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- Mortgage > 124.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 4.25 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Income > 104.50 | | | | | | |--- weights: [442.00, 0.00] class: 0 | | | |--- Degree_Graduate > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- Age <= 28.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 28.50 | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.85 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 58.00] class: 1 | | |--- Degree_Advanced > 0.50 | | | |--- Income <= 114.50 | | | | |--- Mortgage <= 250.00 | | | | | |--- CCAvg <= 2.00 | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.00 | | | | | | |--- Age <= 48.50 | | | | | | | |--- CCAvg <= 3.70 | | | | | | | | |--- CCAvg <= 2.95 | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 2.95 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- CCAvg > 3.70 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | |--- Age > 48.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Mortgage > 250.00 | | | | | |--- weights: [0.00, 3.00] class: 1 | | | |--- Income > 114.50 | | | | |--- weights: [0.00, 68.00] class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.75 | | | | |--- City_Rohnert Park <= 0.50 | | | | | |--- Income <= 106.50 | | | | | | |--- weights: [28.00, 0.00] class: 0 | | | | | |--- Income > 106.50 | | | | | | |--- City_Sacramento <= 0.50 | | | | | | | |--- City_Claremont <= 0.50 | | | | | | | | |--- City_Berkeley <= 0.50 | | | | | | | | | |--- CCAvg <= 2.45 | | | | | | | | | | |--- City_San Francisco <= 0.50 | | | | | | | | | | | |--- weights: [22.00, 0.00] class: 0 | | | | | | | | | | |--- City_San Francisco > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- CCAvg > 2.45 | | | | | | | | | | |--- Income <= 111.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- Income > 111.00 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- City_Berkeley > 0.50 | | | | | | | | | |--- Online <= 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Online > 0.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- City_Claremont > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- City_Sacramento > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- City_Rohnert Park > 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 59.50 | | | | | |--- City_Los Angeles <= 0.50 | | | | | | |--- City_San Jose <= 0.50 | | | | | | | |--- City_El Segundo <= 0.50 | | | | | | | | |--- weights: [0.00, 18.00] class: 1 | | | | | | | |--- City_El Segundo > 0.50 | | | | | | | | |--- Experience <= 31.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- Experience > 31.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- City_San Jose > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- City_Los Angeles > 0.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Age > 59.50 | | | | | |--- weights: [5.00, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- Experience <= 40.00 | | | | |--- weights: [0.00, 154.00] class: 1 | | | |--- Experience > 40.00 | | | | |--- Income <= 124.50 | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Income > 124.50 | | | | | |--- weights: [0.00, 4.00] class: 1
Using the extracted decision rules, you can make interpretations from the decision tree model. For example, if income is greater than 98.5, family size is greater than 2, income is greater than 114, and experience is less than 40, then you may want a personal loan.
Show feature importance¶
importances = model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
A better view of the top ten features.
importances = model2.feature_importances_
indices = np.argsort(importances)
selected_indices = indices[-10:]
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(selected_indices)), importances[selected_indices], color="violet", align="center")
plt.yticks(range(len(selected_indices)), [feature_names[i] for i in selected_indices])
plt.xlabel("Relative Importance")
plt.show()
Observations of the pre-pruned model
- City was not important as a key feature
- Income, education, and family are the top 3 important features.
Decision tree (post-pruning)¶
- Cost complexity pruning provides another option to control the size of a tree.
- In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha.
- Greater values of
ccp_alphaincrease the number of nodes pruned. - Here we only show the effect of
ccp_alphaon regularizing the trees and how to choose the optimalccp_alphavalue.
Total impurity of leaves vs effective alphas of pruned tree
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(random_state=RS, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | -8.232532e-17 |
| 1 | 1.022759e-18 | -8.130256e-17 |
| 2 | 1.022759e-18 | -8.027980e-17 |
| 3 | 1.022759e-18 | -7.925704e-17 |
| 4 | 1.022759e-18 | -7.823429e-17 |
| 5 | 1.410703e-18 | -7.682358e-17 |
| 6 | 1.939716e-18 | -7.488387e-17 |
| 7 | 2.045519e-18 | -7.283835e-17 |
| 8 | 3.068278e-18 | -6.977007e-17 |
| 9 | 3.332785e-18 | -6.643728e-17 |
| 10 | 9.651371e-17 | 3.007642e-17 |
| 11 | 5.144127e-16 | 5.444892e-16 |
| 12 | 1.559252e-04 | 3.118503e-04 |
| 13 | 1.559252e-04 | 6.237006e-04 |
| 14 | 1.577287e-04 | 9.391580e-04 |
| 15 | 2.852687e-04 | 2.080233e-03 |
| 16 | 2.956248e-04 | 3.262732e-03 |
| 17 | 3.006446e-04 | 3.563377e-03 |
| 18 | 3.008424e-04 | 3.864219e-03 |
| 19 | 3.062474e-04 | 4.170466e-03 |
| 20 | 3.089794e-04 | 4.788425e-03 |
| 21 | 3.118503e-04 | 5.100276e-03 |
| 22 | 3.126675e-04 | 5.412943e-03 |
| 23 | 3.957119e-04 | 6.995790e-03 |
| 24 | 5.141036e-04 | 7.509894e-03 |
| 25 | 5.317867e-04 | 8.041681e-03 |
| 26 | 5.508954e-04 | 8.592576e-03 |
| 27 | 6.471632e-04 | 1.053407e-02 |
| 28 | 7.728989e-04 | 1.285276e-02 |
| 29 | 1.040966e-03 | 1.597566e-02 |
| 30 | 1.166170e-03 | 1.714183e-02 |
| 31 | 1.186511e-03 | 1.832834e-02 |
| 32 | 1.357213e-03 | 1.968555e-02 |
| 33 | 1.575733e-03 | 2.126129e-02 |
| 34 | 1.632202e-03 | 2.289349e-02 |
| 35 | 1.937502e-03 | 2.676849e-02 |
| 36 | 1.944857e-03 | 3.260306e-02 |
| 37 | 2.498054e-03 | 3.759917e-02 |
| 38 | 2.721651e-03 | 4.032082e-02 |
| 39 | 2.783906e-03 | 4.310473e-02 |
| 40 | 3.242158e-03 | 4.634689e-02 |
| 41 | 3.823594e-03 | 5.017048e-02 |
| 42 | 3.928297e-03 | 5.409878e-02 |
| 43 | 4.276887e-03 | 6.265255e-02 |
| 44 | 1.032213e-02 | 7.297468e-02 |
| 45 | 2.997084e-02 | 1.029455e-01 |
| 46 | 3.454808e-02 | 2.065898e-01 |
| 47 | 2.934102e-01 | 5.000000e-01 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=RS, ccp_alpha=ccp_alpha, class_weight="balanced"
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.29341024918428205
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.004276886837404323, class_weight='balanced',
random_state=0)
model4 = best_model
confusion_matrix_sklearn(model4, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
model4, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.926286 | 1.0 | 0.577049 | 0.731809 |
confusion_matrix_sklearn(model4, X_test, y_test)
decision_tree_post_test = model_performance_classification_sklearn(
model4, X_test, y_test
)
decision_tree_post_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.928 | 1.0 | 0.542373 | 0.703297 |
Observation
- In the post-pruned tree also, the model is giving a generalized result since the recall scores on both the train and test data are coming to be around 1 which shows that the model is able to generalize well on unseen data.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
model4,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model4, feature_names=feature_names, show_weights=True))
|--- Income <= 96.50 | |--- CCAvg <= 2.95 | | |--- weights: [1360.86, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [68.38, 99.43] class: 1 |--- Income > 96.50 | |--- Family <= 2.50 | | |--- Degree_Advanced <= 0.50 | | | |--- Degree_Graduate <= 0.50 | | | | |--- Income <= 104.50 | | | | | |--- weights: [12.23, 29.83] class: 1 | | | | |--- Income > 104.50 | | | | | |--- weights: [245.71, 0.00] class: 0 | | | |--- Degree_Graduate > 0.50 | | | | |--- weights: [10.56, 323.15] class: 1 | | |--- Degree_Advanced > 0.50 | | | |--- weights: [13.90, 382.81] class: 1 | |--- Family > 2.50 | | |--- weights: [38.36, 914.77] class: 1
Observation
As expected, the model culled out the City data as being less relevant to the model.
importances = model4.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Let's examine just the top ten featues.
importances = model4.feature_importances_
indices = np.argsort(importances)
selected_indices = indices[-10:]
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(selected_indices)), importances[selected_indices], color="violet", align="center")
plt.yticks(range(len(selected_indices)), [feature_names[i] for i in selected_indices])
plt.xlabel("Relative Importance")
plt.show()
Model Comparison and Final Model Selection¶
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_default_perf_train.T,
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree (sklearn default) | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.998857 | 0.926286 |
| Recall | 1.0 | 1.0 | 0.991477 | 1.000000 |
| Precision | 1.0 | 1.0 | 0.997143 | 0.577049 |
| F1 | 1.0 | 1.0 | 0.994302 | 0.731809 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_default_perf_test.T,
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Decision Tree (sklearn default) | Decision Tree with class_weight | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|
| Accuracy | 0.984000 | 0.982000 | 0.987333 | 0.928000 |
| Recall | 0.882812 | 0.898438 | 0.906250 | 1.000000 |
| Precision | 0.926230 | 0.891473 | 0.943089 | 0.542373 |
| F1 | 0.904000 | 0.894942 | 0.924303 | 0.703297 |
- Test data and training data provided similar results.
- Decision tree models with pre-pruning is giving high recall and precision scores on both training and test sets.
- High recall and precision scores are the most desirable feature of the model pruning
- Post-pruning has a significant drop in precision
- Therefore, we are choosing the pre-pruned tree as our best model.
Actionable Insights and Business Recommendations¶
Insights
- The model built can be used to predict whether a person will purchase a personal loan and can correctly identify 98% of those purchases.
- The model can also identify the purchase with very few false positives or false negatives.
- Although key features, such as income, credit card average, and education, play key roles, they are not completely reliable. Only considering these features leads to false negatives, leaving many customers who without the personal loan.
What recommedations would you suggest to the bank?¶
Business recommendations
- Use the pre-pruning model to identify customers for personal loan offers.
- Creative offers could be made online or in branch offices or other marketing.
- Additional research on whether to use online marketing, mention the availability, or offer incentives would be more effective.
- Data should be kept on which marketing program was the one leading to the personal loan.